A Florida health insurance company wants to predict annual claims for individual clients. The company pulls a random sample of 50 customers. The owner wishes to charge an actuarially fair premium to ensure a normal rate of return. The owner collects all of their current customer’s health care expenses from the last year and compares them with what is known about each customer’s plan.
The data on the 50 customers in the sample is as follows:
Answer the following questions using complete sentences and attach all output, plots, etc. within this report.
Randomly select three observations from the sample and exclude from all modeling (i.e. n=47). Provide the summary statistics (min, max, std, mean, median) of the quantitative variables for the 47 observations.
| Characteristic | N = 47 |
|---|---|
| Charges | |
| Mean (SD) | 12,317 (11,498) |
| Median (IQR) | 8,604 (4,480, 13,552) |
| Range | 2,494, 55,135 |
| Age | |
| Mean (SD) | 42 (13) |
| Median (IQR) | 43 (30, 53) |
| Range | 23, 64 |
| BMI | |
| Mean (SD) | 29.0 (5.6) |
| Median (IQR) | 28.5 (25.3, 32.4) |
| Range | 16.8, 42.1 |
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 1.000 1.234 2.000 5.000
## [1] 1.18345
Provide the correlation between all quantitative variables
Run a regression that includes all independent variables in the data table. Does the model above violate any of the Gauss-Markov assumptions? If so, what are they and what is the solution for correcting?
##
## Call:
## lm(formula = Charges ~ ., data = insurance.new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11888 -2726 -1065 711 20257
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -14022.39 6563.47 -2.136 0.039145 *
## Age 287.26 77.04 3.729 0.000626 ***
## BMI 434.97 200.14 2.173 0.036058 *
## Female 858.33 2120.59 0.405 0.687923
## Children 118.17 873.64 0.135 0.893122
## Smoker 23108.13 3009.97 7.677 3.04e-09 ***
## WinterSprings -1659.04 3069.60 -0.540 0.592024
## WinterPark -4853.57 3009.55 -1.613 0.115080
## Oviedo -3769.38 2566.29 -1.469 0.150115
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6722 on 38 degrees of freedom
## Multiple R-squared: 0.7176, Adjusted R-squared: 0.6582
## F-statistic: 12.07 on 8 and 38 DF, p-value: 2.224e-08
We ran a regression model for all independent variables and found the following violations to the Gauss-Markov Theorum Assumptions:
3rd Assumption - Non-Linearity. Residuals v Fitted.
Functional Forms.
- Consider using ratios or percentages rather
than raw data (see module on multicollinearity for a complete discussion
of the associated problems and causes).
4th Assumption - Heteroskedasticity Is Occurring
Within Scale-Location
- There is a cluster of observations around
the 2,500 to 15,000 Fitted Values axis which then fans outwards.
Resulting in inefficient cross-section estimates.
6th Assumption - Normal Distribution Is Not In
Place. [Normal Q-Q)]
- Look for subgroups in data and analyze
separately; use summary data (like the mean value) rather than the raw
data.
Implement the solutions from question 3, such as data transformation, along with any other changes you wish. Use the sample data and run a new regression. How have the fit measures changed? How have the signs and significance of the coefficients changed?
##
## Call:
## lm(formula = LogCharges ~ ., data = insurance.new[, c(10, 2:9)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.65510 -0.14862 -0.05322 0.03263 1.28444
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.033276 0.387771 18.138 < 2e-16 ***
## Age 0.034991 0.004552 7.688 2.94e-09 ***
## BMI 0.011547 0.011824 0.977 0.335
## Female 0.054880 0.125285 0.438 0.664
## Children 0.063550 0.051615 1.231 0.226
## Smoker 1.324284 0.177829 7.447 6.16e-09 ***
## WinterSprings -0.007282 0.181353 -0.040 0.968
## WinterPark -0.051822 0.177804 -0.291 0.772
## Oviedo -0.144341 0.151617 -0.952 0.347
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3972 on 38 degrees of freedom
## Multiple R-squared: 0.7924, Adjusted R-squared: 0.7487
## F-statistic: 18.13 on 8 and 38 DF, p-value: 8.493e-11
##
## Call:
## lm(formula = LogCharges ~ ., data = insurance_LogChrgAgeWDummy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.58853 -0.17786 -0.05451 0.02616 1.27653
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.17330 0.73015 4.346 9.98e-05 ***
## LogAge 1.42556 0.18545 7.687 2.95e-09 ***
## BMI 0.01451 0.01178 1.232 0.225
## Female 0.06560 0.12535 0.523 0.604
## Children 0.05664 0.05168 1.096 0.280
## Smoker 1.32511 0.17782 7.452 6.07e-09 ***
## WinterSprings -0.02476 0.18155 -0.136 0.892
## WinterPark -0.07879 0.17815 -0.442 0.661
## Oviedo -0.14899 0.15168 -0.982 0.332
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3972 on 38 degrees of freedom
## Multiple R-squared: 0.7924, Adjusted R-squared: 0.7487
## F-statistic: 18.13 on 8 and 38 DF, p-value: 8.507e-11
##
## Call:
## lm(formula = LogCharges ~ ., data = insurance.new[, c(12, 2:10)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.62995 -0.14987 -0.05370 0.02717 1.28495
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.7322920 0.9363606 7.190 1.59e-08 ***
## AgeSq -0.0001643 0.0004640 -0.354 0.725
## Age 0.0492269 0.0404749 1.216 0.232
## BMI 0.0124770 0.0122478 1.019 0.315
## Female 0.0605778 0.1277695 0.474 0.638
## Children 0.0598072 0.0532787 1.123 0.269
## Smoker 1.3245151 0.1799132 7.362 9.39e-09 ***
## WinterSprings -0.0149998 0.1847672 -0.081 0.936
## WinterPark -0.0626046 0.1824473 -0.343 0.733
## Oviedo -0.1470754 0.1535865 -0.958 0.344
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4018 on 37 degrees of freedom
## Multiple R-squared: 0.7931, Adjusted R-squared: 0.7428
## F-statistic: 15.76 on 9 and 37 DF, p-value: 3.566e-10
##
## Call:
## lm(formula = LogCharges ~ ., data = insurance.new[, c(2, 4:10,
## 13)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.65610 -0.15185 -0.05397 0.02865 1.27595
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.313609 1.106364 5.707 1.44e-06 ***
## Age 0.034944 0.004564 7.656 3.25e-09 ***
## Female 0.056410 0.125504 0.449 0.656
## Children 0.064999 0.051857 1.253 0.218
## Smoker 1.323267 0.177896 7.438 6.32e-09 ***
## WinterSprings -0.005992 0.181873 -0.033 0.974
## WinterPark -0.045362 0.176489 -0.257 0.799
## Oviedo -0.140444 0.151103 -0.929 0.359
## LogBMI 0.314013 0.330373 0.950 0.348
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3974 on 38 degrees of freedom
## Multiple R-squared: 0.7921, Adjusted R-squared: 0.7484
## F-statistic: 18.1 on 8 and 38 DF, p-value: 8.695e-11
##
## Call:
## lm(formula = LogCharges ~ ., data = insurance.new[, c(2:10, 14)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.65467 -0.14654 -0.04853 0.03424 1.28639
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.111e+00 1.356e+00 5.245 6.61e-06 ***
## Age 3.502e-02 4.643e-03 7.543 5.43e-09 ***
## BMI 6.116e-03 9.201e-02 0.066 0.947
## Female 5.393e-02 1.280e-01 0.422 0.676
## Children 6.287e-02 5.354e-02 1.174 0.248
## Smoker 1.324e+00 1.802e-01 7.349 9.77e-09 ***
## WinterSprings -8.169e-03 1.844e-01 -0.044 0.965
## WinterPark -5.396e-02 1.837e-01 -0.294 0.771
## Oviedo -1.452e-01 1.542e-01 -0.941 0.353
## BMISq 9.296e-05 1.562e-03 0.060 0.953
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4025 on 37 degrees of freedom
## Multiple R-squared: 0.7924, Adjusted R-squared: 0.7419
## F-statistic: 15.69 on 9 and 37 DF, p-value: 3.779e-10
When applying the solutions for the Gauss-Markov Assumptions that were violated we calculated and compared the following:
Overall, our measure of fit for each Regression improved. Resulting in our SEE reducing from 6722 to around .40 in addition to R-Squared and Adjusted R-Squared increasing from 72 & 66 to around 80 & 75 for all models.
Below are the results coefficient significance and sign changes:
1. Log of Charges
- BMI is no longer
significant
- Smoker is now more significant
- Age is now
slightly more significant
2. Log of Charges and Log of
Age
- BMI is no longer significant
- Smoker is now more
significant
- Age is now slightly more significant
3. Log of Charges and Age Squared
- BMI and Age are
no longer significant
- Smoker is now more significant
4. Log of Charges and Log of BMI
- BMI is no longer
significant
- Age and Smoker are more significant
5. Log of Charges and BMI squared
- BMI is no
longer significant
- Age and Smoker are more significant
Use the 3 withheld observations and calculate the performance measures for your best two models. Which is the better model? (remember that “better” depends on whether your outlook is short or long run)
insurance.test$LogCharges <- log(insurance.test$Charges)
insurance.test$BMISq <- insurance.test$BMI^2
insurance.test$AgeSq <- insurance.test$Age^2
insurance.test$bad_model_pred <- predict(model, newdata = insurance.test)
insurance.test$model_1_pred <- predict(model_LogChrgBMISq,newdata = insurance.test) %>% exp()
insurance.test$model_2_pred <- predict(model_LogChrgAgeSq,newdata = insurance.test) %>% exp()
# Finding the error
insurance.test$error_bm <- insurance.test$bad_model_pred - insurance.test$Charges
insurance.test$error_1 <- insurance.test$model_1_pred - insurance.test$Charges
insurance.test$error_2 <- insurance.test$model_2_pred - insurance.test$Charges
## [1] 2096.91
## [1] 240.616
## [1] 356.8711
## [1] 5282.157
## [1] 412.3407
## [1] 512.8377
## [1] 6720.431
## [1] 429.0247
## [1] 584.066
## [1] 0.6206971
## [1] 0.07086708
## [1] 0.07259645
The initial model performed the worst when compared to the other two. When compared to the other two, the bias, MAE, and MAPE of the logarithmic connection are lower. Since Model 2’s RMSE is lower, there were no significant prediction mistakes. Depending on your preferred time frame, you could choose any model. Model 2 is appropriate if you’re considering the near future. If you are considering the long term, choose Model 1.
Provide interpretations of the coefficients, do the signs make sense? Perform marginal change analysis (thing 2) on the independent variables.
##
## Call:
## lm(formula = LogCharges ~ ., data = insurance.new[, c(2:10, 14)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.65467 -0.14654 -0.04853 0.03424 1.28639
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.111e+00 1.356e+00 5.245 6.61e-06 ***
## Age 3.502e-02 4.643e-03 7.543 5.43e-09 ***
## BMI 6.116e-03 9.201e-02 0.066 0.947
## Female 5.393e-02 1.280e-01 0.422 0.676
## Children 6.287e-02 5.354e-02 1.174 0.248
## Smoker 1.324e+00 1.802e-01 7.349 9.77e-09 ***
## WinterSprings -8.169e-03 1.844e-01 -0.044 0.965
## WinterPark -5.396e-02 1.837e-01 -0.294 0.771
## Oviedo -1.452e-01 1.542e-01 -0.941 0.353
## BMISq 9.296e-05 1.562e-03 0.060 0.953
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4025 on 37 degrees of freedom
## Multiple R-squared: 0.7924, Adjusted R-squared: 0.7419
## F-statistic: 15.69 on 9 and 37 DF, p-value: 3.779e-10
With using a confidence level of 95%, the below results would occur when age increases by 1 year. - If a person’s Age increases by 1, their charges would increase by $0.04 give or take $0.01. - If a person is a smoker, their charges would increase by $1.32 give or take $0.38.
An eager insurance representative comes back with five potential clients. Using the better of the two models selected above, provide the prediction intervals for the five potential clients using the information provided by the insurance rep.
| Customer | Age | BMI | Female | Children | Smoker | City |
|---|---|---|---|---|---|---|
| 1 | 60 | 22 | 1 | 0 | 0 | Oviedo |
| 2 | 40 | 30 | 0 | 1 | 0 | Sanford |
| 3 | 25 | 25 | 0 | 0 | 1 | Winter Park |
| 4 | 33 | 35 | 1 | 2 | 0 | Winter Springs |
| 5 | 45 | 27 | 1 | 3 | 0 | Oviedo |
##
## Call:
## lm(formula = LogCharges ~ ., data = insurance.new[, c(2:10, 14)])
##
## Coefficients:
## (Intercept) Age BMI Female Children
## 7.111e+00 3.502e-02 6.116e-03 5.393e-02 6.287e-02
## Smoker WinterSprings WinterPark Oviedo BMISq
## 1.324e+00 -8.169e-03 -5.396e-02 -1.452e-01 9.296e-05
## fit lwr upr
## 1 10940.686 4345.449 27545.74
## 2 6915.164 2941.044 16259.36
## 3 12933.267 4787.337 34939.97
## 4 6410.797 2541.912 16168.27
## 5 8240.879 3380.672 20088.34
The owner notices that some of the predictions are wider than others, explain why.
The largest range for the group of five customers is customer #3. They are a 25 year old male smoker with no children living in Winter Park. The second largest range was customer #1, who is a 60 year old female with no children living in Oviedo. - Due to Age and Smoker having the most significance on Charges, this the cause for the large range.
Are there any prediction problems that occur with the five potential clients? If so, explain.
No prediction problems occur with the five potential potential clients. The correlation between Charges, Age, and smoker are significant. The potential prediction problem outlier could occur due to Customer #4 using our Model # 1 with a r-sqaure of 80%. Due to customer #4 having the highest BMI of the group that is higher than the mean & median could indicate that Customer #4 is our outlier.**